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Abstract —  We  consider  the  use  of  a  mobile  agent  to  monitor 
stochastic,  transient  events  that  occur  in  discrete  locations  in 
the  environment  with  the  objective  of  maximizing  the  number 
of  event  observations  in  a  balanced  manner.  We  assume  that 
the  events  of  interest  at  each  station  follow  a  stochastic  process 
with  an  initially  unknown  and  station-specific  rate  parameter. 
Consequently,  the  persistent  monitoring  problem  we  address 
in  this  paper  is  a  bandit  problem  -similar  to  the  canonical 
Multi-Armed  Bandit  problem-  in  which  we  are  faced  with 
the  Inherent  trade-off  between  exploration  and  exploitation. 
We  introduce  a  novel  monitoring  algorithm  with  provable 
guarantees  that  leverages  variance  estimates  to  generate  policies 
capable  of  simultaneously  taking  into  account  the  pertinent 
monitoring  objectives  and  the  balance  between  exploration  and 
exploitation.  We  present  analysis  establishing  lower  bounds  for 
the  performance  of  our  algorithm  measured  with  respect  to 
the  quality  of  the  policies  generated.  We  present  experimental 
results  supporting  our  proposed  algorithm  and  comparing 
its  performance  to  that  of  current  state-of-the-art  monitoring 
algorithms. 

I.  Introduction 

We  consider  the  problem  of  using  a  single  mobile  robot 
to  monitor  stochastic,  transient  events  of  interest  occurring 
at  discrete  locations  in  the  environment.  We  assume  that 
events  at  each  location  follow  a  stochastic  process  with  an 
unknown  rate  that  is  independent  of  other  locations’  rates. 
Since  the  events  are  stochastic  and  transient,  their  exact 
time  of  occurrence  cannot  be  known  apriori.  Hence,  the 
monitoring  process  requires  the  robot  to  visit  each  location 
and  remain  at  that  location  for  some  amount  of  time  in 
anticipation  of  events  to  occur. 

An  example  of  a  surveillance  task  involving  the  moni¬ 
toring  of  different  bird  species  by  a  documentary  maker  is 
shown  in  Fig.  [T]  Additional  examples  of  scenarios  following 
this  setting  include  robots  patrolling  the  city  in  search  of 
possible  suspicious  activities  or  mobile  sensors  roaming  the 
environment  to  track  wildlife  around  oases  in  the  desert. 
The  aforementioned  scenarios  each  outline  a  persistent  mon¬ 
itoring  problem  for  which  we  would  like  to  use  a  single 
mobile  agent  to  monitor  stochastic  events  in  an  information- 
driven  way.  The  fact  that  we  cannot  concurrently  monitor 
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each  location  due  to  limited  mobile  resources  motivates  the 
need  for  optimal  monitoring  policies. 


Fig.  1.  A  persistent  monitoring  application  in  which  a  documentary  maker 
would  like  to  monitor  three  different  species  of  birds  appearing  in  three 
discrete,  species-specific  locations.  Bird  sightings  at  each  location  follow  a 
stochastic  process  with  a  rate  that  is  initially  unknown  to  the  documentary 
maker  and  must  be  learned  and  approximated  throughout  the  monitoring 
process.  Given  a  cyclic  path  defining  the  sequence  of  stations  to  visit, 
the  documentary  maker  would  like  to  traverse  this  cyclic  path  repeatedly, 
stopping  at  each  station  for  an  appropriate  amount  of  time  to  obseiwe  the 
birds. 

We  assume  that  we  are  given  a  cyclic  patrolling  route  and 
seek  to  generate  the  optimal  observation  time  to  be  spent 
at  each  location  subject  to  a  given  optimality  criteria  [1], 
[2].  There  may  be  several  competing  objectives  of  interest 
in  a  real-world  monitoring  scenario.  These  may  include 
objectives  pertaining  to  the  number  of  events  observed,  the 
distribution  of  attention  to  all  the  stations,  the  time  between 
consecutive  observations  at  a  station,  and  the  classical  trade¬ 
off  between  exploration  and  exploitation  given  the  unknown 
rates  of  the  stations.  In  this  paper,  we  consider  our  overarch¬ 
ing  objective  to  be  maximizing  the  number  of  observations 
across  all  stations  in  a  balanced  way  while  simultaneously 
balancing  the  inherent  exploration  and  exploitation  trade-off. 
We  note  this  case  can  be  extended  to  the  case  of  reasoning 
over  different  trajectories  as  shown  in  [3]. 

Policy  generation  is  rendered  challenging  by  the  fact  that 
the  exact  timing  of  (stochastic)  events  cannot  be  predicted 
in  advance  and  further  the  event  statistics  are  assumed  to  be 
unknown  apriori.  These  relaxed  assumptions  are  in  contrast 
to  previous  problem  definitions  such  as  those  in  [1],  [2],  [3], 
where  the  statistics  of  events  occurring  at  different  locations 
-  such  as  rate  of  occurrence  -  were  assumed  to  be  known. 
In  our  case,  the  relaxation  of  this  assumption  results  in 
the  canonical  exploration  and  exploitation  problem,  as  the 
robot  must  simultaneously  learn  the  statistics  about  events 
in  the  environment  and  adjust  its  policy  in  order  to  optimize 
the  pertinent  monitoring  objective.  The  trade-off  between 
exploration  and  exploitation  that  we  address  in  this  paper 


is  also  faced  by  the  canonical  multi-armed  bandit  problem 
[4],  [5]  and  reinforcement  learning  [6]. 

In  this  paper,  we  introduce  a  novel  persistent  monitoring 
algorithm  with  provable  guarantees  that  quantifies  and  em¬ 
ploys  the  uncertainty  of  our  rate  approximations  to  generate 
policies  in  order  to  reason  about  and  explicitly  consider  the 
inherent  exploration  and  exploitation  trade-off.  We  present 
analysis  proving  probabilistic  error  bounds  on  the  accuracy 
of  rate  approximations  and  the  optimality  of  generated  poli¬ 
cies  as  a  function  of  the  number  of  the  monitoring  cycles. 
We  present  simulation  results  that  compare  the  performance 
of  our  algorithm  with  that  of  an  adaptive  strategy  and  a  state- 
of-the-art  monitoring  algorithm  [2]. 

II.  Related  Work 

In  part  due  to  the  ubiquity  of  persistent  monitoring  tasks, 
the  problem  of  persistent  surveillance  has  been  previously 
addressed  with  respect  to  a  variety  of  applications  and 
environments.  For  instance,  in  [7]  the  authors  considered  per¬ 
sistent  surveillance  of  discrete  locations  -such  as  buildings, 
windows,  doors-  using  a  team  of  autonomous  micro-aerial 
vehicles  (MAVs).  While  UAVs  are  predominantly  associated 
with  persistent  monitoring  tasks,  in  [8]  the  authors  consid¬ 
ered  the  generation  of  monitoring  policies  for  autonomous 
underwater  vehicles  with  the  objective  of  facilitating  efficient 
high-value  data  collection.  Furthermore,  in  [2]  and  [9],  the 
authors  present  different  approaches  to  the  min-max  latency 
walk  problem  in  the  context  of  discrete  stations. 

Persistent  monitoring  problems  for  multiple  agents  have 
also  been  studied  with  regard  to  a  variety  of  different  objec¬ 
tives.  In  [10],  the  authors  consider  the  problem  of  controlling 
multiple  agents  to  minimize  an  uncertainty  metric  in  the 
context  of  a  1-D  mission  space.  Furthermore,  decentralized 
approaches  to  controlling  a  network  of  robots  for  purposes 
of  sensory  coverage  has  been  investigated  in  [11],  where  the 
authors  presented  a  control  law  to  drive  a  network  of  mobile 
robots  to  an  optimal  sensing  configuration. 

In  addition  to  persistent  monitoring  work  in  static  envi¬ 
ronments,  the  case  of  dynamic  environments  has  also  been 
an  avenue  of  interest  [12],  [13],  [14].  Namely,  authors  of 
[12]  considered  optimal  sensing  in  a  time-changing  Gaussian 
Random  Field  and  proposed  a  new  randomized  path  planning 
algorithm  to  find  the  optimal  infinite  horizon  trajectory. 
In  [13],  the  authors  considered  a  changing  environment 
modeled  as  a  field  which  grows  in  locations  that  are  not  in  the 
range  of  the  robot  and  proposed  a  linear  program  to  generate 
speed  controllers  capable  of  keeping  the  field  bounded. 

Persistent  surveillance  is  inherently  closely  related  to 
sensor  scheduling  [15],  sensor  positioning  [16],  and  coverage 
problems  [17].  Thus,  previous  approaches  have  considered 
the  problem  of  persistent  monitoring  in  the  context  of  a  mo¬ 
bile  sensor  [18].  For  instance,  in  [19],  the  authors  considered 
the  problem  of  finding  shortest  watchman  routes  that  enable 
the  watchman  to  traverse  paths  along  which  every  point  in  a 
given  space  is  visible;  the  authors  showed  that  this  problem 
is  NP-hard  in  general. 


From  the  perspective  of  the  persistent  monitoring  problem 
scenario  that  we  address  in  this  paper,  our  work  can  be  seen 
as  most  similar  to  that  of  [2],  where  the  authors  considered 
the  monitoring  of  stochastic,  transient  events  occurring  in 
discrete  locations  in  the  environment.  The  authors  imposed 
the  relatively  strong  assumption  of  having  exact  and  full 
knowledge  of  the  event  statistics  governing  each  stochastic 
process  at  each  location  prior  to  the  monitoring  process. 
In  the  context  of  this  assumption,  the  authors  presented  a 
provably-optimal  algorithm  that  generates  the  unique  opti¬ 
mal  policy  maximizing  the  balance  of  observations  while 
minimizing  the  maximum  time  between  two  consecutive 
observations  at  each  station  [2]. 

Viewed  from  the  perspective  of  sequential  decision  making 
in  the  context  of  uncertainty,  there  exists  parallels  between 
the  monitoring  problem  that  we  consider  in  this  paper  and 
the  canonical  problem  of  prediction  with  expert  advice  where 
the  best  expert  is  unknown  apriori.  An  even  more  profound 
relationship  and  similarity  exists  between  our  problem  and 
the  widely-studied  Multi-Armed  Bandit  (MAB)  problem,  in 
which  a  gambler  is  faced  with  a  row  of  K  slot  machines 
that  each  yield  a  stochastic  reward  according  to  a  machine- 
specific  probability  distribution  with  a  finite  mean  which 
is  initially  unknown  [4],  [5].  The  objective  is  to  maximize 
accumulated  reward  by  choosing  the  optimal  machine  to  play 
at  each  discrete  time  step  so  that  the  expected  regret  with 
respect  to  the  reward  accumulated  after  a  finite  number  of 
time  is  minimized. 

There  exist  algorithms  with  provable  regret  guarantees 
even  in  the  finite-horizon  case  for  both  the  prediction  for  ex¬ 
perts  problem  [20]  and  MAB  [4],  [5].  Unfortunately  however, 
application  or  extension  of  these  algorithms  to  the  problem  of 
persistent  surveillance  is  rendered  non-trivial  due  to  salient 
differences  between  the  persistent  monitoring  problem  that 
we  consider  and  a  widely-studied  bandit  problem  such  as 
MAB.  Namely,  the  persistent  surveillance  problem  that  we 
address  in  this  paper  exhibits  a  continuous  state  and  param¬ 
eter  space,  which  is  in  contrast  to  MAB  where  the  bandit 
attempts  to  choose  the  optimal  lever  to  pull  among  a  finite 
set  of  levers  at  discrete  time  steps,  i.e.  rounds.  Furthermore, 
the  monitoring  problem  we  consider  allows  traveling  the 
given  cyclic  path  multiple  times.  This  necessitates  additional 
reasoning  for  iteration-dependent  policies  that  consider  the 
trade-off  between  the  cost  (i.e.,  wasted  travel  time  that  could 
otherwise  be  spent  on  observing)  incurred  by  traveling  from 
one  station  to  the  next  and  the  total  time  that  should  be  spent 
in  traversing  each  monitoring  cycle. 

In  contrast  to  all  of  the  aforementioned  prior  work  in 
the  realm  of  persistent  surveillance,  we  present  algorithms 
with  provable  guarantees  for  the  problem  of  monitoring  of 
stochastic  and  transient  events  occurring  in  discrete  stations 
in  which  the  event  statistics  are  unknown  apriori.  We  employ 
Bayesian  inference  to  efficiently  learn,  approximate,  and  rea¬ 
son  about  the  event  statistics  at  each  station.  Our  algorithm 
explicitly  quantifies  and  considers  the  uncertainties  over  our 
approximations  to  generate  time-efficient,  adaptive  policies 
which  simultaneously  achieve  near-optimal  monitoring  ob- 


jective  values  and  balance  exploration  and  exploitation. 


III.  Problem  Definition 

Let  there  be  n  G  N+  stations,  labeled  by  i  G  \n],  whose 
locations  are  known.  Events  of  interest  occur  at  each  sta¬ 
tion  i  and  follow  a  Poisson  process  with  an  unknown  rate 
parameter,  denoted  by  Xi,  where  the  rate  for  each  station 
is  independent  of  other  stations’  rates.  We  assume  that  the 
stations  are  spatially  distributed  in  the  domain  and  hence  the 
robot  must  spend  a  non-zero  travel  time  Cij  G  M+  as  it  travels 
from  one  station  i  to  another  station  j. 

We  assume  that  we  are  given  a  cyclic  path  between 
the  stations  and  our  goal  is  to  generate  a  policy  stating 
the  observation  time  that  the  robot  should  spend  at  each 
station  to  optimize  an  arbitrary  monitoring  objective.  Over  a 
monitoring  period  that  is  presumably  bounded  by  resource 
constraints,  a  robot  may  traverse  the  given  cyclic  path 
multiple  times  and  execute  a  variety  of  policies.  We  formally 
define  a  monitoring  cycle  as  the  complete  execution  of  a 
monitoring  policy  and  let  k  G  N+  denote  each  cycle.  Under 
this  terminology,  a  policy  to  be  executed  at  cycle  k,  is 
dehned  as  the  sequence  of  observation  times  per  station,  i.e. 

■=  {ti.k^h.kT  ■  ■  ,tn,k)  where  tik  G  M+  is  the  time  to  spend 


at  each  station  i  G  [n]. 

Due  to  the  presence  of  stochastic  events,  there  will  be 
inevitable  variability  from  the  execution  of  one  monitoring 
cycle  to  the  next.  As  will  detailed  further  in  Sec.  IV  the 
number  of  events  observed,  n,  ^  and  the  time  spent,  f,- at 
each  station  i  constitute  sufficient  statistics  to  be  considered 
at  each  cycle  k;  we  let  Xj^  :=  {ni^k^ti,k)  denote  this  pair  of 
values  and  the  set  of  all  relevant  statistics  obtained  from 
the  start  of  the  monitoring  task  to  iteration  k  as  := 
{Xl,X^,...,X^}. 


As  mentioned  in  Sec.  the  majority  of  problems  that 
face  the  exploration  and  exploitation  trade-off,  such  as 
MAB,  define  the  overarching  optimization  problem  as  the 
minimization  of  regret  after  a  hnite  amount  of  time.  MAB 
is  concerned  with  the  sole  objective  of  maximizing  the 
cumulative  reward  obtained,  hence  there  is  only  one  ob¬ 
jective  function  (regarding  the  accumulated  reward)  that  is 
considered.  However,  in  the  persistent  monitoring  problem 
that  we  address  in  this  paper,  the  overarching  goal  is  to 
simultaneously  maximize  the  number  of  events  observed  and 
maximize  the  balance  of  observations  across  all  stations. 
Consequently,  the  multi-objective  problem  we  consider  in 
persistent  monitoring  renders  the  definition  of  regret  with 
respect  to  multiple  objectives  to  be  non-trivial. 

We  instead  recast  the  problem  of  persistent  monitoring 
in  to  an  optimization  problem  with  respect  to  individual 
monitoring  cycles  and  present  an  alternative,  per-cycle  defi¬ 
nition  for  the  optimization  problem.  Dehning  the  monitoring 
problem  in  this  way  that  is  local  with  respect  to  individual 
monitoring  cycles  can  be  viewed  as  a  heuristic  approach  for 
greedily  generating  high-quality  policies  which  perform  well 
in  minimizing  regret  with  respect  to  both  objectives. 

We  formalize  our  overarching  objective  functions  as  fol¬ 
lows.  We  let  fohs,{ttk)  be  the  objective  function  regarding  the 


expected  number  of  observations  made  across  all  stations, 
i.e. 

n 

fohsittk)  :=  Y,HNi{7tk)]  (1) 

i=i 


where  E[A,(7r,t)]  =  Kti,k  by  definition  of  expectation  for 
a  Poisson  process  with  rate  A,.  In  order  to  reason  about 
balanced  attention,  we  formalize  the  notion  of  observation 
balance  by  letting  the  function  fbaiittk)  denote  as  in  [2]  the 
expected  observations  ratio  for  a  given  which  we  seek  to 
maximize. 


fbaii^k)  :=min 

I 


(2) 


The  theoretical  and  idealized  definition  of  persistent 
surveillance  is  traditionally  defined  as  an  inhnite-horizon 
problem  in  which  the  total  monitoring  time  is  unbounded. 
Intuitively,  we  expect  the  agent  execute  multiple  monitoring 
cycles  of  varying  time  length  depending  on  iteration-specihc 
policies  that  consider  past  history  and  observations.  In  light 
of  a  possibly  unbounded  monitoring  time,  the  two  aforemen¬ 
tioned  objective  functions  dehned  above  do  not  help  establish 
an  appropriate  upper  bound  on  the  total  time  that  should  be 
spent  per  monitoring  cycle. 

Rather  than  imposing  an  arbitrary  bound  on  the  monitoring 
time  per  cycle,  we  let  the  bound  be  a  function  of  the 
uncertainty  over  the  rates  at  each  station.  Namely,  we  seek  to 
establish  an  adaptive  bound  on  the  observation  time  for  each 
station  in  a  way  that  considers  the  trade-off  between  travel- 
cost  and  the  need  to  execute  multiple  monitoring  cycles  so 
that  each  station  can  be  visited  more  than  once.  In  what 
follows,  we  introduce  a  class  policy  optimization  problems 
subject  to  the  uncertainty  constraint,  a  hard  constraint  that 
adaptively  balances  exploration  and  exploitation  by  control¬ 
ling  the  decay  of  uncertainty  over  time.  In  short,  the  premise 
of  the  uncertainty  constraint  is  to  induce  a  rapid  decrease 
of  approximation  uncertainty,  which  enables  more  accurate 
evaluations  of  prospective  policies  in  the  subsequent  cycle, 
leading  to  the  generation  of  high-quality  policies  within  a 
short  amount  of  time. 

More  formally,  let  D,-  :  N  — )■  K>o  be  a  function  that 
quantifies  the  uncertainty  in  our  estimate  of  the  rate  of  each 
station  i  after  a  certain  number  of  iterations.  At  each  iteration 
k,  having  gathered  and  observed  the  events  in  the  previous 
k—l  monitoring  cycles,  we  would  like  to  generate  a  policy  Ttk 
such  that  our  uncertainty  in  our  approximations  decreases  by 
some  factor  after  executing  with  high  probability.  More 
formally,  for  a  given  d  G  (0,1), £  G  (0,5),  each  policy 
must  satisfy  the  following  uncertainty  constraint  V;  G  [n] 


>  l-£.  (3) 


In  light  of  our  monitoring  objectives  and  the  uncertainty 
constraint,  we  formalize  the  per-cycle  optimization  problem 
as  follows. 


Problem  1  (Per-cycle  Monitoring  Optimization  Problem).  In 
each  iteration  k  G  N+  generate  a  policy  that  simultane¬ 
ously  satisfies  the  uncertainty  constraint  0  and  maximizes 


the  balance  of  observations,  i.e.. 


Til  e  fbal{Tlk)  (4) 

TEj. 

s.t.  >l-e  V!G[n]. 

The  per-cycle  problem  above  defines  the  optimization 
to  be  solved  at  each  cycle  A:  G  N+  in  order  to  generate 
an  appropriate  monitoring  policy  Kk-  By  associating  the 
optimization  problem  with  each  cycle,  we  ensure  that  the 
generated  policies  are  adaptive  to  the  events  that  occur  in 
previous  cycles  and  enable  the  consideration  of  refined  rate 
approximations.  In  the  following  section,  we  introduce  a 
method  for  generating  adaptive  policies  at  each  iteration  that 
are  optimal  with  respect  to  the  optimization  problem  defined 
above. 


After  updating,  we  can  employ  the  posterior  distribution 
to  generate  a  refined  approximation,  i.e.  a  point  estimate, 
of  the  rate  parameter  for  station  i.  Since  our  approximations 
will  be  iteratively  changing,  let  denote  the  sequence 

of  approximations  for  A,  with  respect  to  cycle  k.  For  each 
cycle  k  we  can  leverage  the  fact  that  our  updated  posterior 
distribution  is  Gamma(a,',o +  L;=i A +L5=i fij)  set 
our  approximation  to  be  the  posterior  mean,  i.e.. 


A,- 


i,k 


OJ/.O  4”  ^/=l  ^i,k 

Pifi  +  LUfi,k 


Pi 


which  follows  by  definition  of  the  Gamma  distribution.  Once 
updated  approximations  for  the  rates  of  all  stations  are 
available,  i.e.  Ai  jt,  ■  ■  ■ ,  A„  they  can  subsequently  be  used 
to  evaluate  the  objective  functions  described  in  Sec. 


IV.  Methods 


In  this  section,  we  introduce  a  novel  monitoring  algorithm 
that  generates  optimal  policies  at  each  monitoring  cycle  with 


respect  to  the  optimization  problem  defined  in  Sec.  Ill  We 
describe  the  sub-procedure  for  learning  and  approximating 
event  statistics  using  Bayesian  inference,  which  enables  the 
incorporation  of  apriori  knowledge  and  the  generation  of 
rate  approximations  for  each  station.  We  outline  and  provide 
pseudo-code  for  generating  dynamic,  adaptive  policies  that 
appropriately  interleave  learning  and  approximating  event 
statistics  (exploration)  with  generating  and  executing  policies 
(exploitation). 


A.  Learning  and  Approximating  Event  Statistics 

Prior  to  the  monitoring  process,  we  may  have  prior  beliefs 
about  what  the  rate  A,  could  be  for  each  station  i.  To  model 
and  incorporate  any  beforehand  knowledge  regarding  the  rate 
parameter,  we  use  a  Gamma  distribution  defined  by  the  shape 
hyper-parameter  a,  and  the  scale  hyper-parameter  jS,-  as  the 
conjugate  prior  for  the  parameter  A,.  The  hyper-parameters 
a,  and  j3,  will  be  initialized  to  values  representing  the  prior 
beliefs,  which  we  denote  as  the  hyper-parameters  a,  o 
PiX)  and  will  then  be  updated  during  the  monitoring  process 
to  represent  our  posterior  beliefs  given  observations. 

We  can  obtain  the  posterior  distribution  for  the  rate  of  any 
arbitrary  station  by  updating  the  hyper-parameters  a,  and  jS,-. 
Given  the  current  values  of  a,  and  j3,  at  cycle  k,  consider 
observing  nf  k  observations  in  f,  ^  time.  Then,  our  posterior 
distribution  in  light  of  the  observations  is  defined  as: 

)- . “ 

oc  Gamma(a,-  -f  n^k, Pi  +  U^k) ■  (5) 

where  Q  follows  by  conjugacy.  For  any  arbitrary  number  of 
ni  k  events  observed  during  tj  k  time,  the  posterior  update  pro¬ 
cedure  simply  entails  updating  the  hyper-parameters  based 
on  their  previous  values  and  the  values  of  n,  and  f,  i,  i.e. 

^  a,  +n,  jt  A  ^  Pi  +  ti,k  at  each  cycle.  More  generally, 

after  k  monitoring  cycles  our  posterior  distribution  is  given 
by  Gamma(a,-.o  -I-  n;j-,  A  + 1^=1  fi.j)- 


B.  Controlling  Approximation  Uncertainty 

We  formalize  the  definitions  of  the  uncertainty  function 
and  the  uncertainty  constraint  ([^  introduced  in  Sec.  and 
present  a  method  to  generate  policies  subject  to  the  uncer¬ 
tainty  constraint.  The  premise  of  the  uncertainty  approach 
is  to  enable  efficient  generation  of  high-quality  policies 
by  enforcing  a  controlled  and  rapid  expected  decay  of  the 
uncertainty  of  our  approximations  with  high  probability. 

1  is  a  function 


Recall  from  Sec.  Ill  at  cycle  k,  Vj :  N+  - 
that  quantifies  our  uncertainty  in  our  rate  estimate  for  the  rate 
at  station  i.  We  choose  to  formally  define  the  uncertainty 
function  as  the  variance  of  the  posterior  distribution  after  k 
cycles,  i.e.. 


vfk)  :=Vari?,i\X'-'^)  =  ^  = 


k^i,0  T  ^/=1  kii,k 


Pf  (A.o+ILiAA 


2  ■ 


(6) 


Under  this  setting,  for  a  given  policy  Kk  at  cycle  k  G  N+  the 
uncertainty  constraint  as  defined  in  Sec.  IB^  is  equivalent  to: 


P(yar(A,|V'-'=)  <  5yar(A;|v/'^-Al^/' 


>  1  -e 


for  all  stations  i  G  [n].  We  further  simplify  the  uncertainty 
constraint  by  employing  the  definition  of  posterior  variance 
and  obtain 


^  l<Xi-\-Ni^k{T^k)  ^  X  OLi 

=  ^{Ni,k{tuk)<5K{Uk)\Xl 


>  1  -e 


(7) 

(8) 


where  Ni^k{ti,k)  ^  Poisson(A,f,-i:)  and  K{Uk)  ■=  5p(A  + 
ct/- 

Generating  an  appropriate  tj  k  that  satisfies  the  inequality 
given  by  (j^  above  requires  that  we  reason  about  the  possible 
values  that  the  random  variable  Ni^ki'k^k)  can  assume.  Hence, 
in  order  to  make  a  more  informed  decision  in  generating  the 
observation  time  f,  we  leverage  the  notion  of  a  credible 
interval  in  our  policy  generation  process.  Namely,  given  a 
fixed  e  G  (0,  j)  and  past  observations  at  station  i,  we 

construct  a  credible  interval  for  A,  denoted  by  the  open  set 
:=  (A/,A4)  such  that: 


VA;  G  K+  P(A,-  G  (A‘,A“)|^/^'^^‘)  =  1  - e 


for  the  rate  parameter  A,  of  each  station  i.  In  constructing 
the  the  credible  interval,  we  reason  about  the  regularized 
Gamma  function,  denoted  by  Q{a,s),  due  to  its  inherent 
relationship  with  the  cumulative  distribution  function  of 
the  posterior  Gamma  distribution.  Namely,  we  employ  the 
inverse  of  the  regularized  Gamma  function  with  respect  to 
the  second  variable,  Q^^{a,s),  to  generate  an  equal-tailed 
credible  interval  as  follows: 

_  e-'(a,',i-f) 

'■ — pi -  ' — pi — ■ 

In  addition,  we  have  the  property  that 

Since  we  have  constructed  an  equal-tails  credible  interval 
with  respect  to  e  €  (0,  j),  the  following  holds  by  definition: 

VA,-  e  K+  P(A/  >  A,ix/^^-')  =  P(A“  < 

Now,  putting  it  all  together,  given  observations 
after  having  executed  k  —  1  cycles  and  the  end  points  of 
the  confidence  interval  A/  and  A“,  generating  an  optimal 
observation  time  f/j.  for  each  station  i  entails  efficiently 
generating  an  observation  time  that  satisfies  the  inequality 
given  by  0.  A  notable  observation  in  this  context  is  the 
existence  of  an  uncountably  infinite  values  of  f,  4  that  satisfy 
this  inequality.  This  follows  from  the  fact  that  the  expression 
K{tiP)  follows  a  quadratic  relationship  with  respect  to  tjj^, 
whereas  a  linear  relationship  exists  in  the  expression  defining 
the  distribution’s  parameter,  i.e.  A,f,  i.  Recalling  that  achiev¬ 
ing  low-latency,  i.e.  monitoring  cycles  with  small  amount  of 
observation  times,  is  preferable  for  a  monitoring  task,  we 
pick  the  minimum  f/j.  possible. 

In  other  words,  the  optimal  is  defined  as  the  optimal 
solution  to  the  following  optimization: 


inf 


s.t.  F{Ni,k{ti,k)<5K{ti,k)\X^-’^-^) 


>  1  -e. 


Unfortunately,  due  to  the  inherent  complexity  of  the  cumu¬ 
lative  distribution  function  for  the  Poisson  random  variable, 
using  a  non-linear  optimization  method  to  generate  the  opti¬ 
mal  solution  is  rendered  computationally  expensive.  Hence, 
instead  of  performing  exact  computation  for  the  quantile 
function,  we  employ  a  sharp  inequality  provided  by  [21] 
which  improves  upon  the  Chernoff-Hoeffding  inequalities 
by  a  factor  of  at  least  two  in  approximating  the  cumulative 
distribution  function  introduced. 

As  demonstrated  rigorously  in  the  analysis  section  (Sec. 
the  use  of  this  approximation  combined  with  the  obser¬ 
vation  regarding  the  quadratic-linear  relationship  of  K{ti  P) 
and  Xitij^  results  in  a  simplified  solution  for  f*^.  Namely,  an 
appropriate  choice  of  t*i^  is  given  by 


tlk  :=  f  e . 


where  H{m,k)  is  the  Kullback-Leibler  (KL)  divergence  be¬ 
tween  two  Poisson  distributed  random  variables  with  means 
m  and  k  and  W  is  the  Lambert  W  function.  .  An  appropriate 


value  for  can  be  efficiently  obtained  by  invoking  a  root¬ 
finding  algorithm  such  as  Brent’s  method  on  equation  above. 

The  constant  factor  5  S  (0, 1)  controls  the  rapidness  of 
uncertainty  decay.  There  exists  a  tradeoff  between  low  values 
of  5,  which  lead  to  lengthy,  but  also  risky  policies,  and 
high  values  for  5  which  lead  to  shorter,  but  less  efficient 
policies  due  to  incurred  travel  time.  Thus,  in  order  to  pick 
an  appropriate  value  of  5,  we  use  the  following  generalized 
sigmoid  function  that  takes  into  account  the  number  of 
stations  and  the  total  travel  time  per  cycle: 

g(«):=gmin+ 

where  5mm,Smax  G  (0,1), Ti,:=  (L"r/Q,,+i  -f  e  K+ 

are  the  lower  asymptote,  the  upper  asymptote,  and  the 
growth  rate  respectively,  where  Smin  =  1/^;  ^max  =  0.99,  S,-  = 
(Ztl  Ci,i+i  +  Cn.i)  ^  are  the  lower  asymptote,  the  upper 
asymptote,  and  the  growth  rate  respectively.  It  is  worth  noting 
that  5  is  a  static  variable  and  is  initialized  only  once  in  the 
beginning  of  the  monitoring  process. 


C.  Generating  Balanced  Policies  that  Consider  Approxima¬ 
tion  Uncertainty 

We  extend  the  method  defined  in  the  previous  section 
so  that  the  generated  policy  simultaneously  satisfies  the 
uncertainty  constraint  and  balances  attention  given  to  all 
stations  in  the  minimum  time  possible.  The  key  insight 
behind  our  approach  is  that  if  we  first  compute  a  ;riow  := 
. . .  ,4°'*')  where  each  f!°"  is  defined  by  the  expression 
given  by  Eq.  0  acts  as  a  lower  bound  on  each  of  the 
observation  times.  In  other  words,  any  observation  time  f,  for 
a  particular  station  i  that  is  higher  than  given  by  ;riow  is 
ensured  to  satisfy  uncertainty  constraint  by  monotonicity  as 
described  in  Sec.|V]  Now,  we  can  initially  set  Tt^  '■  =  TTiow  to 
ensure  that  and  any  policy  with  higher  observation  times 
satisfies  the  uncertainty  constraint. 

In  addition  to  satisfying  the  uncertainty  constraint  on 
the  observation  times,  we  must  also  satisfy  the  balance 
constraint,  i.e.  maximize  objective  function  The  idea  is 
to  generate  a  new  policy  by  increasing  the  observation 
times  of  so  that  satisfies  the  balance  and  uncertainty 
constraints.  Note  that  achieves  the  optimal  balance  value 
if  and  only  if: 

E[Ai  i^)]  =  E[A2(%*)]  =  •  •  •  =  E[A„«)]  = 


For  the  initial  lower  bound  policy  =  ttiow  =  (4°*)  ■  •  • 
from  the  expression  in  the  previous  section,  it  may  very  well 
be  the  case  that  the  above  equality  does  not  hold.  However, 
Uk  can  be  modified  by  first  looking  at  the  maximum  number 
of  expected  events  that  needs  to  be  matched,  i.e.  Nmax  ■= 

maxig[„]  A, -4°". 

Now,  using  Amax  we  can  increase  the  observation  times  in 
Uk  to  generate  the  true  optimal  policy  =  {th^,  ■  ■  -dnk)- 
generation  procedure  for  each  observation  time  is  as  follows: 


L*  . 

i,k 


Amax 

X.k 


—  Amax 


Our  policy  generating  function  that  summarizes  the  sequence 
of  steps  described  above  is  shown  in  Alg.  The  entirety 
of  our  monitoring  algorithm  which  employs  Alg 
subprocedure  to  generate  policies  is  shown  as  Alg 


Algorithm  1  Executes  the  entirety  of  the  monitoring  process 
and  employs  Algorithm  as  a  sub-procedure  to  generate 
policies. 

Input: 

prior  parameters  for  each  station  i 
Cij:  travel  times  for  all  pairs  of  stations  i,j 

1: 

2:  /3i  -i—  pifi 

3:  Ttr  •<—  T!i=l  Q,i+1  +  Cn,0 

4:  while  NotDoneMonitoringO  do 
5:  for  i  G  [n]  do 

6:  Xi  ^  ^ 

7:  7Z*  ^  AlgorithnQ  a, ,  ) 

8:  //  Monitoring  loop 

9:  for  i  G  [n]  do 

10:  //  Observe  events  for  time  t* 

1 1 :  ^observed  ^  ObserveEvents  (ff ) 

12:  //  Update  hyper-parameters  of  the  posterior 

13:  CCj  i  (Xj  -f  ^observed 

14:  pi^pi  +  t* 

15:  return  n*  = 


V.  Analysis 


Algorithm  2  Generates  to  an  optimal  policy  with  respect  to 
optimization  problem  [T]  defined  in  Sec.  Ill 


Input: 


Output: 


hyper-parameters  of  the  posterior 
Ttr!  total  travel  time  for  the  given  cycle 
e  G  (0,  j):  user- given  input 


n*  =  (fj , . . .  ,f*):  optimal  policy 
//  Calculate  the  lower  bound  for  n 


ax  -Ai  in 

l+e-«rtr 


^  ^  ^min 

//  Upper  end-point  of  the  1  —  e  credible  interval 

1  ,  Q  '(A>f) 

Au  ^  a. 

for  i  G  [n]  do 

//  Define  the  function  A,  (f) 

Ki{t)  :=  +  -  at 

flow  ^  f  g  I  w)  =  0 

//  Calculate  the  maximum  E[A,  (7riow)] 

A max  ^  maXj'g  ft™  ^ 


//  Balance  attention  based  on  Amax 

for  i  G  [n]  do 

t*  4—  N  A 
^  i  ^  ^  ’  max  ( 

15:  return  k* 


■a, 


problem  ^presented  in  Sec.  is  given  by 

_ ^max  _  A 

k  ' —  ^  —  ^^max 

’  Kk  «'■ 


In  this  section,  we  present  analysis  proving  the  fact  that 
the  each  iterations-specific  policy  generated  by  our  algorithm 
described  in  Sec.  1^  is  an  optimal  solution  with  respect  to 


the  optimization  problem  defined  in  Sec.  Ill  with  respect 
to  the  generated  rate  approximations  Xt^td  G  [«]  at  each 
iteration  k  G  N+.  Subsequently,  we  establish  guarantees  on 
the  posterior  variance  and  absolute  error  of  our  rate  approxi¬ 
mations  as  a  function  of  optimization  iterations.  We  conclude 
by  employing  the  aforementioned  properties  to  establish  a 
bound  on  the  quality  of  our  generated  solutions  with  respect 
to  those  generated  by  an  Oracle  Algorithm  that  is  assumed 
to  have  perfect  knowledge  of  the  ground-truth  rates. 

We  begin  by  showing  that  at  every  monitoring  iteration  k  G 
N+,  the  policy  defined  by  the  sequence  of  observation  times, 
each  generated  according  to  the  expression  in  [^presented  in 
Sec.  IV  is  optimal  with  respect  to  the  rate  approximations. 


Lemma  I  (Satisfaction  of  the  Uncertainty  Constraint).  For 
a  given  e  G  (0,  j),^  G  (0,1)  at  iteration  k  G  N,  a  value  of 
t*j^  given  by  Eq.  (l9|)  satisfies  the  uncertainty  constraint  (Eq. 

0- 


Lemma  2  (Optimality  of  Generated  Observation  Times). 
For  all  e  G  (0,  G  (0,1)  at  iteration  k  G  N,  an  optimal 
observation  time  for  each  station  with  respect  to  optimization 


where  N^ax  ■=  iiiax;g[„]  A,-  jtf,-™  and  t^fjf  is  given  by  expression 

©• 

In  light  of  the  appropriateness  of  our  choices  for  the  obser¬ 
vation  time,  we  can  establish  further  guarantees  that  pertain 
to  the  posterior  variance  and  the  error  of  our  approximations. 


Lemma  3  (Bound  on  Posterior  Variance).  For  any  e  G 
(0,  j),^  G  (0, 1),  after  k  G  N+  iterations,  the  posterior  vari¬ 
ance  Var{Xi\X^^''^'>)  is  bounded  above  by  5^Var{Xi)  with 
probability  at  least  (1  — e)\  i.e., 

¥{Var{X,\xl''-‘‘'>)  <  5'‘Var{Xi)\X^^'-'^'>)  >  {I  -  ef 


for  all  stations  i  G  [n]  where  Var{Xi)  :=  -jfi- 

A.o 

variance. 


is  the  prior 


Corollary  4  (Bound  on  Approximation  Variance).  For  any 
e  G  (0,  j),  5  G  (0, 1),  after  k  G  N+  iterations,  the  variance  of 
our  approximation  is  bounded  above  by 

5^^^Var{Xi)  with  probability  greater  than  (1  —  e)^^^  i.e., 

F  {VarCXi.k\X^^''‘-''>)  <  5'‘-'VariXi)\X^^'’‘-''>)  >  (1-e)'^^' 
for  all  stations  i  G  [n\. 


Theorem  5  -Bound  on  the  Approximation  Error).  For  any 
e  €  (0,  j),  5  €  (0, 1),  after  k  G  N+  iterations,  for  any  ^  S  K+, 
our  approximation  ^  lies  within  a  ball  of  radius  ^  centered 
at  Xi  with  probability  at  least  (1  —  (1  —  - — i.e., 

P(IV-A,| 

for  all  i  G  [«]. 

Theorem  6  (A-Bound  on  Policy  Optimality).  For  any  i®,  G 
M+,  i  G  [n],  given  that  0  <  — A,  |  <  with  probability  as 

given  in  Theorem^  let  Gmin  :=  LJLi (A;  —  and  Omax  '■= 
I^!=i(A( +‘?()  *•  Then,  the  objective  value  of  our  policy 
at  iteration  k  is  within  a  factor  of  A  of  the  ground-truth 
optimal  solution,  where  A  ;=  with  probability  greater 

than  (1  -  e)«('^-l)  (1  - 


Fig.  2.  We  present  results  conveying  the  quality  of  the  statistics  approxima¬ 
tions  as  a  function  of  monitoring  time.  The  rapid  rate  of  approximation  error 
for  the  performance  of  our  algorithm  (cyan)  supports  the  conjecture  that  our 
algorithm  is  able  to  generate  adaptive  policies  conducive  to  an  accelerated 
rate  of  error  decrease  in  contrast  to  the  other  algorithms’  performance. 

VI.  Results 

In  this  section,  we  present  results  that  portray  the  perfor¬ 
mance  of  our  algorithm  in  a  monitoring  scenario  and  contrast 
the  quality  of  policies  generated  by  our  algorithm  to  that  of 
a  current  state-of-the-art  algorithm  for  persistent  surveillance 
[2]  and  a  dynamic  algorithm  that  represents  a  naive  method 
of  generating  adaptive  policies.  We  consider  the  results  of 
the  experiments  in  two  settings;  (i)  a  synthetic  simulation 
scenario  in  which  the  events  are  generated  according  to  a 
Poisson  distribution  and  (ii)  a  real-world  inspired  scenario 


Fig.  3.  Additional  results  of  the  performance  of  our  algorithm  with  respect 
to  the  Mean  Squared  Error  (MSE)  metric  as  a  function  of  monitoring 
time.  Similar  to  the  results  shown  in  Fig.  we  note  that  our  algorithm 
(cyan)  is  able  to  obtain  accurate  approximations  for  event  statistics  quickly, 
resulting  in  smaller  values  for  MSE  in  when  compared  to  those  of  the  other 
monitoring  algorithms’  values  (red  and  blue). 

-denoted  as  the  yellow  backpack  scenario-  simulated  in  the 
ARMA,  a  tactical  military  game. 

The  synthetic  simulation  framework  and  the  monitoring 
algorithm  were  implemented  in  Python.  The  experiments 
were  conducted  on  a  MacBook  Pro  with  one  3.1  GHz  Intel 
Core  i7  (4  cores  total)  processor  and  16  GB  of  RAM.  In 
what  follows,  we  present  the  experimental  scenarios  and  the 
respective  results. 

A.  Synthetic  Simulation  Results 

We  obtained  results  from  10,000  trials  (per  algorithm) 
of  a  simulated  persistent  monitoring  scenario  involving  the 
monitoring  of  events  in  3  discrete  stations  for  a  monitor¬ 
ing  period  of  10  hours  (600  minutes).  The  settings  for 
the  environment  and  the  ground-truth  rates  were  randomly 
generated  by  generating  random  variables  from  the  following 
distributions  for  each  of  the  three  stations: 

1)  Prior  Hyper-parameter  a,  o  ~  Uniform(l,20) 

2)  Prior  Hyper-parameter  j3,p  Uniform(0.5, 1) 

3)  Rate  parameter  A,-  ~  Uniform(^^,  -^)  events  per 

4)  Cost  of  travel  from  station  i  to  an  adjacent  station  j 
Cij  ~  Uniform (2, 5)  minutes  of  travel  time. 

In  this  synthetic  simulation,  the  arrival  times  of  the  random 
events  were  specihcally  drawn  from  a  Poisson  distribution 

To  ensure  consistency  and  compare  the  algorithms  in  a 
fair  manner,  we  incorporated  the  same  learning  and  ap¬ 
proximation  procedure  detailed  in  Sec.  m  for  the  other 
two  algorithms.  This  integration  enabled  us  to  measure  the 


Bal.  Events,  Incremental  Search,  Bal.  Events, 

Min  Delay  Bal.  Events  Min  (Ours) 

Algorithm 


Fig.  4.  The  performance  of  our  monitoring  algorithms  with  respect  to  the 
objective  function  pertaining  to  the  balance  of  observations.  We  can  see  that 
when  the  balance  of  observations  is  considered  with  respect  to  the  entire  10 
hour  monitoring  window,  the  policies  generated  by  our  algorithm  achieve  a 
significantly  higher  objective  value  than  do  the  those  generated  by  the  other 
two  algorithms. 


Bal.  Events,  Incremental  Search,  Bal.  Events, 

Min  Delay  Bal.  Events  Min  (Ours) 

Algorithm 


Fig.  5.  The  performance  of  each  algorithm  with  respect  to  the  objective 
pertaining  to  the  number  of  events  observed  within  the  allotted  monitoring 
time  (10  hours).  We  note  that  our  algorithm  (cyan)  enables  the  agent  to 
observe  significantly  more  events. 

performance  of  an  algorithm  by  that  operated  under  the 
assumption  of  known  rates  prior  to  the  monitoring  procedure 


[2]. 

The  label  and  description  of  each  algorithm  along  with  its 
corresponding  color  in  the  figures  are  as  follows: 

1)  Bal.  Events,  Min.  Delay  (Red):  the  algorithm  introduced 
by  [2]  which,  as  mentioned  in  Sec.  assumes  that  the 
event  statistics  are  available  apriori. 

2)  Incremental  Search,  Bal.  Events  (Dark  Blue):  an  al¬ 
gorithm  that  acknowledges  the  presence  of  the  explo¬ 
ration/exploitation  trade-off  and  attempts  to  generate 
adaptive  and  lengthier  policies.  The  algorithm  initially 
begins  with  a  random  upper  bound  on  the  total  cycle 
time.  After  each  monitoring  iteration,  the  algorithm 
increases  the  upper  bound  monotonically  by  a  small  ran¬ 
dom  amount  (with  an  expected  increase  of  5  minutes) 
by  generating  observation  times  that  balance  expected 
observations  subject  to  an  arbitrary  upper  bound  on  the 
total  cycle  time. 

3)  Bal.  Events,  Min  <7^  (Cyan):  our  algorithm  introduced 
in  this  paper  that  employs  variance  estimates  to  si¬ 
multaneously  generate  policies  and  balance  the  explo¬ 
ration/exploitation  trade-off  in  a  near-optimal  way. 

We  show  plots  of  relative  approximation  error  as  a  func¬ 
tion  of  time,  the  total  number  of  events  observed  on  average 
after  10  hours  of  monitoring,  the  balance  of  observations 
with  respect  to  all  of  the  observations  made  in  the  10  hour 
monitoring  period,  and  the  total  computation  time  spent 
for  generating  policies  during  the  execution  of  a  trial.  As 
expected,  the  results  show  that  our  algorithm  shown  in  cyan 
in  all  figures  is  able  to  relatively  outperform  the  other  two 
evaluated  algorithms  with  respect  to  every  metric.  Namely, 
from  the  figures  we  can  see  that  our  algorithm  is  able  to  effi¬ 
ciently  generate  balanced  policies  leading  to  policies  capable 
of  achieving  near-optimal  monitoring  objective  values  while 
simultaneously  inducing  a  rapid  decline  of  approximation 
uncertainty. 

B.  ARMA  Simulation  Results 

To  be  completed:  in  this  subsection,  we  present  the 
results  regarding  the  yellow  backpack  scenario:  a  real- 
world  inspired  monitoring  application  involving  the 
surveillance  of  people  wearing  yellow  backpacks  in  an 
ARMA  simulation  (see  Figs.  and  |^. 

VII.  Conclusion 

In  this  paper  we  introduced  novel  algorithms  and  objective 
criteria  for  the  task  of  persistent  monitoring  of  events  with 
statistics  that  are  unknown  a  priori.  Our  algorithms  bridged 
previous  literature  and  tools  pertaining  persistent  surveil¬ 
lance  and  machine  learning  in  order  to  introduce  algorithms 
that  were  able  to  simultaneously  explore  and  exploit  the 
environment  with  respect  to  a  given  monitoring  objective. 
Namely,  our  algorithms  considered  maximizing  the  number 
of  observations  across  all  stations  in  a  balanced  manner  while 
simultaneously  ensuring  the  controlled  decay  of  uncertainty 
in  our  rate  approximations  .  We  presented  analysis  showing 
the  favorable  properties  of  our  algorithm  with  regard  to 


Fig.  6.  We  present  results  conveying  the  decaying  behavior  of  the  posterior 
vaiiance  as  a  function  of  time  in  a  scenario  involving  3  stations.  We  note 
that  while  the  variance  is  monotonically  decreasing  in  expectation  for  all 
algorithms  compared,  our  algorithm  (cyan)  enables  a  much  more  rapid  decay 
of  the  variance. 


Bal.  Events,  Incremental  Search,  Bal.  Events, 

Min  Delay  Bal.  Events  Min  a‘-  (Ours) 

Algorithm 


Fig.  8.  Work  in  progress:  a  rendition  of  the  yellow  backpack  scenario 
simulated  in  ARMA.  In  this  scenario,  agents  randomly  wander  around  a 
town-like  environment  that  includes  buildings  and  apartments.  The  persistent 
monitoring  task  for  a  robot  -such  as  a  UAV-  is  to  continuously  survey  the 
environment  by  following  a  given  circular  patrolling  cycle,  observing  for  an 
appropriate  amount  of  time  at  particular  locations,  and  detecting  potentially 
malicious  people  wearing  yellow  backpacks  during  the  observation  process. 


Fig.  9.  Work  in  progress:  a  viewpoint  of  the  yellow  backpack  simulation 
in  ARMA  different  from  that  of  Fig.  is  shown.  Three  -potentially 
suspicious-  agents  carrying  yellow  backpacks  are  seen  near  a  building 
located  at  the  intersection  of  two  streets. 

event  statistics  and  compared  our  monitoring  approach  to 
the  state-of-the-art.  In  future  work  we  intend  to  relax  the 
assumptions  imposed  on  the  events  further  and  extend  our 
work  to  dynamic  and  large-scale  environments. 


Fig.  7.  We  present  results  showing  the  computation  time  required  to 
generate  the  policies  in  the  simulated  scenaiio.  We  can  see  that  our  algorithm 
(cyan)  is  able  to  generate  high-quality  policies  with  higher  computational 
efficiency  when  compared  to  the  other  state-of-the-art  algorithms. 


uncertainty  and  policy  optimality.  We  performed  computa¬ 
tional  experiments  with  a  diverse  environment  in  terms  of 
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VIII.  Appendix 
A.  Proofs  of  Results  Presented  in  Sec. 

1 )  Proof  of  Lemma 

Lemma  1  (Satisfaction  of  the  Uncertainty  Constraint).  For 
a  given  e  C  (0,  S  (0,1)  at  iteration  k  gN,  a  value  of 
t*^.  given  by  Eq.  (l9|)  satisfies  the  uncertainty  constraint  (Eq. 

I3- 

Proof  We  first  show  that  the  proposed  value  of  t*  satishes 
the  uncertainty  condition  ([^  for  all  stations  i  G  [n].  Recall 
from  Sec.  m  that  the  uncertainty  constraint  is  equivalent  to 
the  following; 

P  {Nufl,)  <  dK{tl,)\X^^-^-^)  >  1  -  e  (10) 

with  ~Poisson(A;f*^)  and  K{t*jfi  :=  5^(A  +  fa)^- 

Now,  we  can  employ  the  credible  interval  established  in 
Alg.  1^  to  further  simplify  the  left-hand  side  of  ([T0|): 

F{Nftl,)<K{tl,)\X^^'-^-^^)  (11) 

(ii'-i)  a  MD)0 


> 


> 


= vmik)  <  K{t*,)\xr^^>A)v{x\xr^^’)dx 

p mi,)  <  Kfl,) \xl^'-'^-^\X)V (A \xl^'-’^-^^)dX 
J  0 

A 

=  (1  - \)^mik)  <  A)  (12) 

where  we  utilized  the  generated  credible  interval  for  A,  and 
the  fact  that 

=  ^^inf  (v,(f*,)  <  K{tl,)\xl^'-'^-^\X) 
to  establish  the  inequalities. 

We  can  further  simplify  the  expression  in  ([T2|)  by  estab¬ 
lishing  a  lower  bound  for  the  cumulative  distribution  function 
of  a  Poisson  random  variable  with  mean  mfl,)  =  Xu{xi)t*,, 
given  the  value  Kfl,).  Using  the  inequality  established  by 
[21],  we  have  that  the  following  holds  for  k  >  m: 

-H(m,k) 

P {Nftl,)  <k)>\ - .  .  (13) 

max|2,  ^ A%Fl{m,k)\ 

where  m  :=  E[V,(fA)|'^u]  =  X^t*,,  k  —  K*,,  and  H{m,k)  is 
the  Kullback-Leibler  (KL)  divergence  between  two  Poisson 
distributed  random  variables  with  means  m  and  k  dehned  as 

/  k 

H{m,k)  :=  m  —  k  +  k\n  | 


We  note  that  by  dehnition  of  t*,,  H{m{tl,),K{tl,))  =  H*  = 


1-- 


=  1- 


max 


2-g’ 


and  thus 


Continuing  from  ([T2|  in  light  of  this  inequality  yields 

^mik)<K{tim^'''^'^) 

>  (1  - \)Vmik)  < K{tl,)\X^'-'^-^\x.) 
>(i-|)(i-^)  =  i- 

Putting  it  all  together,  we  have  for  our  choice  of  t*,  given 
by  (g  that  for  any  e  S  (0,  j) 


□ 


2 )  Proof  of  Lemma  ^ 


Lemma  2  (Optimality  of  Generated  Observation  Times). 
For  all  e  G  (0,  j),^  G  (0, 1)  at  iteration  A:  C  N,  an  optimal 
observation  time  for  each  station  with  respect  to  optimization 
problem  ^presented  in  Sec.  Ill  is  given  by 


_ Xlmax  _  l^i 

k  ' —  ^  —  ^^max 

Kk  «<■ 


2-£ 


where  Nmax  ■=  A,-  ttijf  and  v°jf  is  given  by  expression 

dU. 

Proof.  We  argue  by  contradiction,  suppose  that  there  exists 
some  t*,  that  happens  to  not  be  the  optimal  solution  to 
problem  This  implies  that  either  (i)  violates  the  uncer¬ 
tainty  constraint  ([^  or  (ii)  induces  an  unbalanced  observation 
scheme. 

We  immediately  see  that  case  (i)  leads  to  a  contradiction 
since  is  defined  to  be  bounded  below  by  the  solution  to 
given  by  the  expression  in  Eq.  (j^,  f!™,  hence  by  monotonic¬ 
ity  of  the  uncertainty  condition,  any  value  greater  than  or 
equal  to  also  satishes  the  inequality  given  by  Q.  Similarly, 
we  note  that  (ii)  also  leads  to  a  contradiction  and  thus  cannot 
occur  since  by  dehnition  of  each  t*,,  we  have; 

Xl.ki\^,  —  AVmaxi  “  ^maxi  ■  ■  ■  ;  Xn.kin.k  ~  ^max- 

which  implies  that  .  fl ,)  maximizes  balance 

(i.e.,  objective  function]^ 

E[Ai  (;rn]  =  E[iV2(%*)]  =  •  •  •  =  E[A„(;r,*)] 
nl  G  argmax/bai(;rA:) 

hence,  we  have  that  (ii)  leads  to  a  contradiction.  Since 
we  have  exhausted  all  the  cases  of  sub-optimality,  it  must 
be  the  case  that  for  all  stations  i  G  \n]  and  all  iterations 
k  gN,  the  value  of  t*,  is  optimal,  implying  that  the  policy 
^k  ~  k’  ■  ■  -dlik)  respect  to  the  per-cycle  optimization 
problem.  □ 

3 )  Proof  of  Lemma 

Lemma  3  (Bound  on  Posterior  Variance).  For  any  e  G 
(0,  ^),S  G  (0, 1),  after  k  G  N+  iterations,  the  posterior  vari¬ 
ance  yar(A,jA'(^'*()  is  bounded  above  by  5^Var{Xi)  with 
probability  at  least  (1  — £)\  i.e., 

P(yar(A,|v/‘^*')  <  5W(A,')|V(*^^))  >  (1  -£)^ 


for  all  stations  i  G  [n]  where  Var{Xi)  :=  is  the  prior 

Pi.O 

variance. 


Proof.  Note  that  by  Chebyshev’s  inequality  states  the  fol 
lowing: 


Proof.  From  Lemma  we  have  that  each  t*^.  is  ensured  to  -  (l  A-i)'i 

satisfy  the  uncertainty  condition  •H  Vie  [n]  ’  - 1 

P(yar(A,'|x(''^))  <  5Var{Xi\X^^'-‘‘^^^)\X^^'-'‘^^^)  >  1-e 

(14)  In  light  of  Corollary]^  we  have  that 


for  each  iteration  k  regardless  of  the  events  that  transpire  in 
the  other  iterations.  Hence,  the  probability  of  satisfying  this 
condition  for  k  consecutive  iterations  is  greater  than  (1  —  e)^. 
This  implies  that,  with  probability  at  least  (1  —  e)^,  we  have 
that  the  following  chain  of  of  inequalities  holds: 


employing  this  inequality  and  Chebyshev’s  inequality  yields 


Var{Xi\xj^'’)  <  5Var{Xi), 

Var{Xi\X^^-^'>)  <  5Var{Xi\X^^^)  =  5^Var{Xi), 


yar(A,|x/'-*^^)  <  5yar(A,|x/‘'^'  =  5Var(A,) 


'{\X,,-Xt\<^\X^^--^-^^) 


>  (1  -ef  '(l  - 


5*-'yar(A;)^ 


□ 


4 )  Proof  of  Corollary 

Corollary  4  (Bound  on  Approximation  Variance).  For  any 
e  S  (0,  j),  5  S  (0, 1),  after  k  G  N+  iterations,  the  variance  of 
our  approximation  Var(^Xik\X^^'^^^^)  is  bounded  above  by 
5^^^Var{Xi)  with  probability  greater  than  (1  —  i.e., 

V  {VarCXi,k\X^''-'^-'^)  <  5'‘-War{Xi)\X^^-'^-'^)  >  (1-e)'^-' 

for  all  stations  i  G  [n\. 

Proof.  Employing  the  law  of  total  conditional  variance,  we 
have  for  each  i  G  [n] 

VariXi\X^^'-^-^^) 

=  E[yar(A,'|v('''=))]+yflr(E[A,-|v(^'^)]|vV:'=-i)) 

=  E[yar(A,' I  '•'=))]  +yflr(Ai.i:|v(''*^-')) 
>yflr(Aa|v(^^^-^)) 

Invoking  Lemma  i  we  have  that  Var{Xi\X.^'^  V)  < 
5^^*yar(A,)  with  probability  greater  than  (1  — Com¬ 
bining  this  inequality  with  the  above  application  of  law  of 
total  conditional  variance  yields  the  result.  □ 

5 )  Proof  of  Theorem 

Theorem  5  (^  -Bound  on  the  Approximation  Error).  For  any 
e  €  (0,  j),  5  S  (0, 1),  after  k  G  N+  iterations,  for  any  ^  G  M+, 
our  approximation  A,-  ^  lies  within  a  ball  of  radius  ^  centered 
at  A,-  with  probability  at  least  (1  —  e)^^*  (1  —  - — i.e., 

P(|\,-A,|  >  (i-e)^-i(i-^l^l^) 


□ 


6)  Proof  of  Theorem^ 

Theorem  6  (A-Bound  on  Policy  Optimality).  For  any  G 
K+,  i  G  [n],  given  that  0  <  |A,'^jt  — A,  |  <  with  probability  as 
given  in  Theorem^  let  a,nin  '■='L"=\{h-  and  Umax  '■= 
*•  Then,  the  objective  value  of  our  policy 
at  iteration  k  is  within  a  factor  of  A  of  the  ground-truth 
optimal  solution,  where  A  :=  with  probability  greater 

than  (1  -  e)«(<^-l)  (1  - 

Proof.  Let  T  —  Y!i=\i*k  observation  time  allo¬ 

cated  by  the  generated  policy.  Then,  by  the  optimality  of 
policy  nl  =  {t\kT--fnk)  respect  to  the  rate  approxi¬ 
mations,  we  have  the  following  equalities 

~  A^maxj  ^24^24  ~  ^maX5  ■  ■  ■  ~  ^max- 


which  implies  that 


Vi  G  [n] 


f*  . 

i,k 


T 


1 

X.k 


Now  recall  that  the  objective  function  pertaining  to  balance 
0  is  given  by: 


/bai(?tt:)  :=min 


E[Vi(7r*:)] 

r,=inNM)Y 


for  all  i  G  [«]. 


and  the  optimal  (maximal)  value  of  this  function  is  ^  Now, 
using  the  fact  that  |A,  ,t  — A,j  <  i*,-,  we  have  the  following 


inequalities  for 


T 


mm,- 


I". 


l-Uihk)-^ 

T 

I.U{kk)-' 

T 


> 


LU(M+^i)- 

tiT 


_  Y.Uih-^iV 

_  1  /  ^min  \ 

^  f^max 


with  probability  at  least  (1  —  e)”(^  (l  —  - — 


□ 


